Before we begin, please make sure you have all the necessary packages installed and loaded according to the list in the README and as outlined in the code below.
#load these librarieslibrary(here)
here() starts at /Users/sethmetcalfe/PSY6422_250134696_Project
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(naniar)library(sf)
Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
For this project, I used a data set from the UK Department of Health and Social Care called “Alcohol Profile 2025.” You can access this data set here. To download the correct data set, scroll down on the linked webpage to the “Get the data” section. There are three subheadings, the first of which is “Profile: Alcohol Profile.” Under that are two hyper-linked data sets. The correct data set to download is the first one, which is called “Data for Government Office Region (E12).” Because I am using the “here” package, I saved the downloaded data to my project in a new subfolder called “Data.”
Note: there are three hyper-links called “Data for Government Office Region (E12).” We only need the one from the “Profile: Alcohol Profile” heading.
Note: if you are accessing this data organically from a web search rather than through the provided link, be sure to change the “Area type” tab to “Regions (Statistical)” to ensure you download the data set that includes statistics across the regions of England.
Loading the Data
We will be loading the using the “here” package. Make sure you have it downloaded and loaded for this project. I saved my raw data to a subfolder called “Data.” If you have done something different, edit the following code to match your chosen directory.
#loading in the data set and assigning it the name "raw" and making it a tibbleraw <-read_csv(here("Data", "indicators-Regionsstatistical.data.csv"))
Rows: 34027 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Indicator Name, Parent Code, Parent Name, Area Code, Area Name, Ar...
dbl (9): Indicator ID, Value, Lower CI 95.0 limit, Upper CI 95.0 limit, Low...
lgl (2): New data, Compared to goal
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
raw_tib <-as_tibble(raw)
After loading the data, I immediately transformed the data into a tibble to make the following tidying and filtering of the data easier for R Studio to read.
Viewing and Understanding the Data
The data was collected by the Department of Health and Social Care as part of an initiative by the Office for Health Improvement and Disparities to track different alcohol-related indicators in the United Kingdom. There are 24 different indicators tracked in this profile, including such things as alcohol-related mortality, under 75 mortality rate from alcoholic liver disease, and admission episodes for alcohol-specific conditions. This data has been collected from public health services such as hospitals and health centres across the UK. For this project, we are specifically looking at the indicator of “admission episodes for alcohol-specific conditions.” Let’s take a look at the beginning of the raw data.
#getting a first look at the raw datahead(raw_tib)
# A tibble: 6 × 27
`Indicator ID` `Indicator Name` `Parent Code` `Parent Name` `Area Code`
<dbl> <chr> <chr> <chr> <chr>
1 90861 Under 75 mortality rat… <NA> <NA> E92000001
2 90861 Under 75 mortality rat… <NA> <NA> E92000001
3 90861 Under 75 mortality rat… <NA> <NA> E92000001
4 90861 Under 75 mortality rat… E92000001 England E92000001
5 90861 Under 75 mortality rat… E92000001 England E92000001
6 90861 Under 75 mortality rat… E92000001 England E92000001
# ℹ 22 more variables: `Area Name` <chr>, `Area Type` <chr>, Sex <chr>,
# Age <chr>, `Category Type` <chr>, Category <chr>, `Time period` <chr>,
# Value <dbl>, `Lower CI 95.0 limit` <dbl>, `Upper CI 95.0 limit` <dbl>,
# `Lower CI 99.8 limit` <dbl>, `Upper CI 99.8 limit` <dbl>, Count <dbl>,
# Denominator <dbl>, `Value note` <chr>, `Recent Trend` <chr>,
# `Compared to England value or percentiles` <chr>,
# `Compared to percentiles` <chr>, `Time period Sortable` <dbl>, …
The data has 27 columns of variables, most of which are not pertinent for creating my visualization. Immediately, you can see indicator data including the names of the indicators and their respective ID numbers. We can also see region data under the “Area Name” and “Area Code” columns. There is also information for age and sex of patients. Probably the most important variable is the “Value” column, which gives information pertinent to each indicator. In the case of my project, the Value variable is telling us how many Alcohol-Specific Admissions occurred per 100,000 people in the given years and regions.
The Shapefile Data to Build the Map
In order to build the map itself, we need a shapefile of England’s nine regions (which include the North East, North West, Yorkshire and the Humber, East Midlands, West Midlands, East of England, London, South East, and South West). I obtained this shapefile from the Open Geography Portal by the Office of National Statistics. The shapefile called “Regions (December 2023) Boundaries EN BGC.” You can download the shapefile here. On this linked webpage, hit download on the left-hand side, and when you do, ten download options will be available. Choose the second option called “Shapefile.” Because I am using the “here” package, I saved the downloaded data to my project in the “Data” subfolder, like before.
Note: this download will need to be unzipped in order to access it.
Load the Shapefile using the code below.
#loading the map datamap <-read_sf(here("Data", "Regions_December_2023_Boundaries_EN_BGC_7009286823150963379"))
Research Question
My visualization is attempting to answer the following question: how have the numbers of alcohol-specific hospital admissions changed in each region of England since the year 2016? I suppose this is a very specific question, but hopefully through this visualization, it will be clear if the different regions of England are getting better or worse when it comes to their problematic alcohol use. I will urge viewers not to misunderstand or misrepresent the data. This is a measure of change, not a comparative measure of admissions in total. You might notice that the North West region is getting better, while the West Midlands is getting worse. However, in total, the North West region has roughly 150 more admissions per 100,000 than the West Midlands, which means that their drinking behavior is more problematic than the West Midlands according to total admissions. The graph is showcasing positive and negative trends in alcohol consumption, not comparing which regions are healthier (in a manner of speaking) than others.
Data Preparation
After loading all the data, we need to clean it! To start, I checked if there are any duplicated items in the data with the following code. Thankfully, none were detected.
#checking for duplicated data, no duplications were foundduplicated(raw_tib)
After that, I ran a few pieces of code to make the variable names cleaner. Primarily, that meant removing spaces in the variables of interest and their respective column data observations. The following code replaces any spaces with underscores for four of the variables of interest and all observations.
#renaming relevant variables to remove spacesnames(raw_tib)[names(raw_tib) =="Indicator ID"] <-"Indicator_ID"names(raw_tib)[names(raw_tib) =="Area Name"] <-"Area_Name"names(raw_tib)[names(raw_tib) =="Time period"] <-"Time_Period"names(raw_tib)[names(raw_tib) =="Area Code"] <-"Area_Code"#replacing spaces in the data with underscoresraw_cleaned <- raw_tib %>%mutate_if(is.character, str_replace_all, ' ', '_')
Next, I went through and selected only the variables and observations from the raw data that I needed for my visualization. I called this new data frame “df”, and those variables included Indicator_ID, Area_Name, Time_Period, and Value. Within them, I needed to make sure I was only getting region specific information, so I filtered out the area code for England as a whole. I also filtered out all the indicators that I wasn’t using and made sure that we were using sex data from all persons, not just males or females.
#filtering and selecting the relevant datadf <- raw_cleaned %>%filter(Indicator_ID ==92906) %>%filter(Area_Code !="E92000001") %>%filter(Sex =="Persons") %>%select(Indicator_ID, Area_Name, Time_Period, Value)
Using the nanair package, I also removed any missing pieces of data.
#removing missing datadf <-drop_na(df)
The last two steps in cleaning the overall data were to round the observations in the Value column to the nearest whole number at to filter for only the correct indicator. I rounded the variables because the Department of Health and Social Care did the same when they created their own visualizations of their data. Why should I decide to do otherwise? The filtering step may seem unnecessary, but I have kept it primarily as a renaming function which was necessary for my following code. (Originally, I was using two indicators so two names were needed. As changes were made, I kept the naming conventions to reduce the risk of personal mistakes.)
#rounding values to nearest whole number #(I made this decision only because the original data source did the same)df <- df %>%mutate_if(is.numeric, round)#filtering further to separate 92906 data from the rest of the tibbledf_all <- df %>%filter(Indicator_ID ==92906)
Filtering, Filtering, Filtering
The last step of cleaning the data was creating 9 different data frames from the combined data from for the 9 regions. This is a bit arduous, but it ended up being necessary in order for me to create disparately colored regions in the final visualization.
#list of filters for the region types of all admissiondf_all_NE <- df_all %>%filter(Area_Name =="North_East_region_(statistical)")df_all_NW <- df_all %>%filter(Area_Name =="North_West_region_(statistical)")df_all_York <- df_all %>%filter(Area_Name =="Yorkshire_and_the_Humber_region_(statistical)")df_all_EM <- df_all %>%filter(Area_Name =="East_Midlands_region_(statistical)")df_all_WM <- df_all %>%filter(Area_Name =="West_Midlands_region_(statistical)")df_all_EE <- df_all %>%filter(Area_Name =="East_of_England_region_(statistical)")df_all_Lon <- df_all %>%filter(Area_Name =="London_region_(statistical)")df_all_SE <- df_all %>%filter(Area_Name =="South_East_region_(statistical)")df_all_SW <- df_all %>%filter(Area_Name =="South_West_region_(statistical)")
Similarly, I had to separate out the data from the Shapefile for the 9 regions.
Before building the final visualization, I first needed to create the subplots that would appear when you hover over the regions of the map. Before doing that though, I needed to build a save location to put the subplots in order to save them as tooltip items. I did this with the following code.
#The goal of this section is to set a file path to save the subplots in a tooltip#to pop them up when we hover over them in the final map plotplot_to_svg_base64 <-function(plot_obj, width =4, height =3) { # Create a temporary file path tmp_file <-tempfile(fileext =".svg")# Save the plot object to the temporary SVG fileggsave( tmp_file, plot = plot_obj, device = svglite, width = width, height = height, units ="in" )# Read the SVG content svg_content <-readChar(tmp_file, file.info(tmp_file)$size)# Clean up the temporary fileunlink(tmp_file)# Encode the SVG content to Base64 to embed directly in HTML tooltip base64_string <-base64encode(charToRaw(svg_content))# Return the full data URIreturn(paste0("data:image/svg+xml;base64,", base64_string))}
Once that was set, I could build 9 subplots for the 9 regions ussing ggplot2:
#plot of the time series of all admissions in the North East Regionts_all_NE <- df_all_NE %>%ggplot(aes(x = Time_Period, y = Value)) +geom_point(color ="black") +geom_line(aes(group ="Area_Name"), color ="darkred") +ylim(600, 1000) +labs(title ="Admissions in the North East", x ="Years", y ="Number of Admissions (per 100,000)") +theme_classic() +theme(plot.title =element_text(size =10, face ="bold"), axis.title =element_text(size =8), axis.text =element_text(size =7, angle =45, hjust =1, vjust =1))ne_plot_svg <-plot_to_svg_base64(ts_all_NE, width =3, height =2.5)#plot of the time series of all admissions in the North West Regionts_all_NW <- df_all_NW %>%ggplot(aes(x = Time_Period, y = Value)) +geom_point(color ="black") +geom_line(aes(group ="Area_Name"), color ="skyblue") +ylim(700, 900) +labs(title ="Admissions in the North West", x ="Years", y ="Number of Admissions (per 100,000)") +theme_classic() +theme(plot.title =element_text(size =10, face ="bold"), axis.title =element_text(size =8), axis.text =element_text(size =7, angle =45, hjust =1, vjust =1))nw_plot_svg <-plot_to_svg_base64(ts_all_NW, width =3, height =2.5)#plot of the time series of all admissions in the Yorkshire Regionts_all_York <- df_all_York %>%ggplot(aes(x = Time_Period, y = Value)) +geom_point(color ="black") +geom_line(aes(group ="Area_Name"), color ="orange") +ylim(400, 800) +labs(title ="Admissions in Yorkshire \nand the Humber", x ="Years", y ="Number of Admissions (per 100,000)") +theme_classic() +theme(plot.title =element_text(size =10, face ="bold"), axis.title =element_text(size =8), axis.text =element_text(size =7, angle =45, hjust =1, vjust =1))york_plot_svg <-plot_to_svg_base64(ts_all_York, width =3, height =2.5)#plot of the time series of all admissions in the East Midland Regionts_all_EM <- df_all_EM %>%ggplot(aes(x = Time_Period, y = Value)) +geom_point(color ="black") +geom_line(aes(group ="Area_Name"), color ="orange") +ylim(400, 700) +labs(title ="Admissions in the East Midlands", x ="Years", y ="Number of Admissions (per 100,000)") +theme_classic() +theme(plot.title =element_text(size =10, face ="bold"), axis.title =element_text(size =8), axis.text =element_text(size =7, angle =45, hjust =1, vjust =1))em_plot_svg <-plot_to_svg_base64(ts_all_EM, width =3, height =2.5)#plot of the time series of all admissions in the West Midland Regionts_all_WM <- df_all_WM %>%ggplot(aes(x = Time_Period, y = Value)) +geom_point(color ="black") +geom_line(aes(group ="Area_Name"), color ="darkred") +ylim(400, 700) +labs(title ="Admissions in the West Midlands", x ="Years", y ="Number of Admissions (per 100,000)") +theme_classic() +theme(plot.title =element_text(size =10, face ="bold"), axis.title =element_text(size =8), axis.text =element_text(size =7, angle =45, hjust =1, vjust =1))wm_plot_svg <-plot_to_svg_base64(ts_all_WM, width =3, height =2.5)#plot of the time series of all admissions in the East of England Regionts_all_EE <- df_all_EE %>%ggplot(aes(x = Time_Period, y = Value)) +geom_point(color ="black") +geom_line(aes(group ="Area_Name"), color ="orange") +ylim(300, 600) +labs(title ="Admissions in the East of England", x ="Years", y ="Number of Admissions (per 100,000)") +theme_classic() +theme(plot.title =element_text(size =10, face ="bold"), axis.title =element_text(size =8), axis.text =element_text(size =7, angle =45, hjust =1, vjust =1))ee_plot_svg <-plot_to_svg_base64(ts_all_EE, width =3, height =2.5)#plot of the time series of all admissions in the London Regionts_all_Lon <- df_all_Lon %>%ggplot(aes(x = Time_Period, y = Value)) +geom_point(color ="black") +geom_line(aes(group ="Area_Name"), color ="orange") +ylim(400, 700) +labs(title ="Admissions in the London Region", x ="Years", y ="Number of Admissions (per 100,000)") +theme_classic() +theme(plot.title =element_text(size =10, face ="bold"), axis.title =element_text(size =8), axis.text =element_text(size =7, angle =45, hjust =1, vjust =1))lon_plot_svg <-plot_to_svg_base64(ts_all_Lon, width =3, height =2.5)#plot of the time series of all admissions in the South East Regionts_all_SE <- df_all_SE %>%ggplot(aes(x = Time_Period, y = Value)) +geom_point(color ="black") +geom_line(aes(group ="Area_Name"), color ="skyblue") +labs(title ="Admissions in the South East", x ="Years", y ="Number of Admissions (per 100,000)") +ylim(100, 800) +theme_classic() +theme(plot.title =element_text(size =10, face ="bold"), axis.title =element_text(size =8), axis.text =element_text(size =7, angle =45, hjust =1, vjust =1))se_plot_svg <-plot_to_svg_base64(ts_all_SE, width =3, height =2.5)#plot of the time series of all admissions in the South West Regionts_all_SW <- df_all_SW %>%ggplot(aes(x = Time_Period, y = Value)) +geom_point(color ="black") +geom_line(aes(group ="Area_Name"), color ="orange") +labs(title ="Admissions in the South West", x ="Years", y ="Number of Admissions (per 100,000)") +ylim(400, 800) +theme_classic() +theme(plot.title =element_text(size =10, face ="bold"), axis.title =element_text(size =8), axis.text =element_text(size =7, angle =45, hjust =1, vjust =1))sw_plot_svg <-plot_to_svg_base64(ts_all_SW, width =3, height =2.5)
Finally, we could build the overall visualization. As a quick precursor, I set a vector to contain the colors I wanted for each region.
#begin by setting a vector of the colors used in the legendcolors_legend <-c("Increasing"="darkred", "Decreasing"="skyblue", "No Significant Change"="orange")